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Abstract 

Many techniques in computer vision, machine learning, and statistics rely on the fact that a signal of interest 
admits a sparse representation over some dictionary. Dictionaries are either available analytically, or can be 
learned from a suitable training set. While analytic dictionaries permit to capture the global structure of a signal 
and allow a fast implementation, learned dictionaries often perform better in applications as they are more adapted 
to the considered class of signals. In imagery, unfortunately, the numerical burden for ( i) learning a dictionary and 
for (ii) employing the dictionary for reconstruction tasks only allows to deal with relatively small image patches 
that only capture local image information. 

The approach presented in this paper aims at overcoming these drawbacks by allowing a separable structure on 
the dictionary throughout the learning process. On the one hand, this permits larger patch-sizes for the learning 
phase, on the other hand, the dictionary is applied efficiently in reconstruction tasks. The learning procedure 
is based on optimizing over a product of spheres which updates the dictionary as a whole, thus enforces basic 
dictionary properties such as mutual coherence explicitly during the learning procedure. In the special case where 
no separable structure is enforced, our method competes with state-of-the-art dictionary learning methods like 
K-SVD. 

1. Introduction 

Exploiting the fact that a signal s E M n has a sparse representation over some dictionary D E R nxd is the 
backbone of many successful signal reconstruction and data analysis algorithms. Having a sparse representation 
means that s is the linear combination of only a few columns of D, referred to as atoms. Formally, this reads as 

s = Dx, (1) 

where the transform coefficient vector x E M. d is sparse, i.e. most of its entries are zero or small in magnitude. 
For the performance of algorithms exploiting this model, it is crucial to find a dictionary that allows the signal 
of interest to be represented most accurately with a coefficient vector x that is as sparse as possible. Basically, 
dictionaries can be assigned to two classes: analytic dictionaries and learned dictionaries. Analytic dictionaries 
are built on mathematical models of a general type of signal they should represent. They can be used universally 
and allow a fast implementation. Popular examples include Wavelets [ ], Bandlets [ ], and Curvlets [19] among 
several others. It is well known that learned dictionaries yield a sparser representation than analytic ones. Given 
a set of representative training signals, dictionary learning algorithms aim at finding the dictionary over which 
the training set admits a maximally sparse representation. Formally, let S = [si, . . . , s m ] E R nXm be the matrix 
containing the m training samples arranged as its columns, and let X = [xi, . . . ,x m ] E R dxm contain the 
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corresponding m sparse transform coefficient vectors, then learning a dictionary can be stated as the minimization 
problem 

minimize g(X) subject to ||DX - S\\ 2 F < e, D E <£. (2) 

X.,D 

Therein, g : R dxm — > R is a function that promotes sparsity, e reflects the noise power, and £ is some predefined 
admissible set of solutions. Common dictionary learning approaches employing optimization problems related 
to (2) include probabilistic ones like [11, 14, 26], and clustering based ones such as K-SVD [ ], see [ ] for 
a more comprehensive overview. The dictionaries produced by these techniques are unstructured matrices that 
allow highly sparse representations of the signals of interest. However, the dimension of the signals which are 
sparsely represented and, consequently, the possible dictionaries' dimensions are inherently restricted by limited 
memory and limited computational resources. Furthermore, when used within signal reconstruction algorithms 
where many matrix vector multiplications have to be performed, those dictionaries are computationally expensive 
to apply. 

In this paper, we present a method for learning dictionaries that are efficiently applicable in reconstruction tasks. 
The crucial idea is to allow the dictionary to have a separable structure, where separable means that the dictionary 
D is given by the Kronecker product of two smaller dictionaries A E R hxa and B E M™ x6 , i.e. 

D = B ® A. (3) 

The relation between a signal s E M. hw and its sparse representation x E R ab as given in (1) is accordingly 
s = (B ® A)x = vec(A vec _1 (x)B T ), where the vector space isomorphism vec: M, axb —> M, ab is defined 
as the operation that stacks the columns on top of each other. Employing this separable structure instead of 
a full, unstructured dictionary clearly reduces the computational costs of both the learning algorithm and the 
reconstruction tasks. More precisely, for a separation with h,w ~ ^Jn, the computational burden reduces from 
0(ri) to 0(y/n). We will refer to this new learning approach as SeDiL (Separable Dictionary Learning). 

It is apparent that this approach applies in principle to any class of signals. However, we will focus on signals 
that have an inherently two dimensional structure such as images. However, it is worth mentioning that SeDiL 
can straightforwardly be extended to signals with higher dimensional structure, such as volumetric 3£>-signals, by 
employing multiple Kronecker products. To fix the notation for the rest of this work, if A and B are as above, the 
two dimensional signal S E M, hxw has the sparse representation X E M. axb , i.e. S = AXB T . 

The proposed dictionary learning scheme SeDiL is based on an adaption of Problem (2) to a product of unit 
spheres. Furthermore, it incorporates a regularization term that allows to control the dictionary's mutual coher- 
ence. The arising optimization problem is solved by a Riemannian conjugate gradient method combined with a 
nonmonotone line search. For the general separable case, the method is able to learn dictionaries for large patch 
dimensions where conventional learning techniques fail while if we define B = 1 SeDiL yields a new algorithm 
for learning standard unstructured dictionaries. A denoising experiment is given that shows the performance of 
both a separable and a non- separable dictionary learned by SeDiL on (8 x 8) -dimensional image patches. From 
this experiment it can be seen that the separable dictionary outperforms its analytic counterpart, the overcomplete 
discrete cosine transform, and the non-separable one achieves similar performance as state-of-the-art learning 
methods like K-SVD. Besides that, to show that a learned separable dictionary is able to extract and to recover the 
global information contained in the training data, a separable dictionary is learned on a face database with each 
face image having a resolution of 64 x 64 pixels. This dictionary is then applied in a face inpainting experiment 
where large missing regions are recovered solely based on the information contained in the dictionary. 

2. Structured Dictionary Learning 

Instead of learning dense unstructured dictionaries, which are costly to apply in reconstruction tasks and are 
unable to deal with high dimensional signals, techniques exist that aim at learning dictionaries which bypass 



these limitations. In the following, we shortly review some existing techniques that focus on learning efficiently 
applicable and high dimensional dictionaries, followed by introducing our approach. 

2.1. Related Work 

In [17] and [24], two different algorithms have been proposed following the same idea of finding a dictionary 
such that the atoms themselves are sparse over some fixed analytic base dictionary. The algorithm proposed in 
[17] enforces each atom to have a fixed number of non-zero coefficients, while the one suggested in [24] imposes 
a less restrictive constraint by enforcing sparsity over the entire dictionary. However, both algorithms employ 
optimization problems that are not capable of finding a large dictionary for high dimensional signals. In [2] an 
alternative structure for dictionaries has been proposed. The so called signature dictionary is a small image itself, 
where every patch at varying locations and size is a possible dictionary atom. The advantages of this structure 
include near-translation-invariance, reduced overfitting, and less memory and computational requirements, com- 
pared to unstructured dictionary approaches. However, the small number of parameters in this model also makes 
this dictionary more restrictive than other structures. This approach has been further extended in [ ] to learn real 
translational-invariant atoms. Hierarchical frameworks for tackling high dimensional dictionary learning are pre- 
sented in [ ] and [ ]. The latter work uses this framework in conjunction with a screening technique and random 
projections. We like to mention that our approach has the potential to be combined with hierarchical frameworks. 

2.2. Proposed Approach 

We aim at learning a separable dictionary D = B® A from a given set of training samples S = (Si, ... , S m ) E 
^hxwxm so i v i n g a problem related to (2). We denote the collection of the m sparse representations by X = 
(Xi, . . . , X m ) and measure its overall sparsity via 

m a b 
3=1 k=l 1=1 

where x^ij is the (fc, /)-entry of Xj E R axb and p > is a weighting factor. We impose the following regulariza- 
tion on the dictionary. 

(i) The columns of D have unit Euclidean norm. 

(ii) The coherence of D shall be moderate. 

Constraint (i) is commonly employed in various dictionary learning procedures to avoid the scale ambiguity prob- 
lem, i.e. the entries of D tend to infinity, while the entries of X tend to zero as this is the global minimizer of the 
unconstrained sparsity measure g(X). Matrices with normalized columns admit a manifold structure, known as 
the product of spheres, which we denote by 

S(n, d) := {D E R nxci | ddiag(D T D) = I d }. (5) 

Here, ddiag(Z) forms a diagonal matrix with the diagonal entries of the square matrix Z, and 1^ is the (d x d)- 
identity matrix. Consequently, we require that A is an element of S(/i, a) and that B is an element of S(w, b). 

The soft constraint (ii) of requiring a moderate mutual coherence of the dictionary is a well known regularization 
procedure in dictionary learning, and is motivated by the compressive sensing theory. Roughly speaking, the 
mutual coherence of D measures the similarity between the dictionary's atoms, or, "a value that exposes the 
dictionary's vulnerability, as [...] two closely related columns may confuse any pursuit technique." [10]. The most 
common mutual coherence measure for a dictionary D with normalized columns is 

/i(D) := max |d7dj|. (6) 

i<j 



For the rest of this paper we will follow this notation and denote the i th column of a matrix Q by the corresponding 
lower case character q^. In order to relax this worst case measure, other measures have been introduced in the 
literature that are more suited for practical purpose, for example averaging the largest entries of {\djdj\ \ i < j} 
as in [8, 10, 21], or by considering the sum of squares of all elements in {Id^dj | | i < j}, cf. [9]. In this work, we 
introduce an alternative mutual coherence measure, which has been proven extremely useful in our experiments. 
Explicitly, we measure the mutual coherence via 

r(D):=-£ ln(l - (d7d,) 2 ). (7) 

l<i<j<d 

Since this measure is differentiate, it can be integrated into smooth optimization procedures. Furthermore, when 
it is used within a dictionary learning scheme, the log-barrier function avoids the algorithm from producing dic- 
tionaries that contain repeated identical atoms. 

Note that minimizing r(D) implicitly influences fi(D). Concretely, the relation between (7) and the classical 
mutual coherence (6) is 

r(D)>-ln(l-(MD)) 2 )>^r(D), (8) 

with N := d(d — l)/2 denoting the number of summands of (6). To see the validity of the above equation, note 
that since the atoms are normalized to one, the equation < |d^dj| 2 < 1 holds due to the Cauchy-Schwarz 
Inequality. Thus, all summands — ln(l — (d^dj) 2 ) are non-negative. Moreover, 

max(-ln(l - (d^d,) 2 )) = -ln(l - ( M (D)) 2 ), (9) 

Kj 

and therefore 

-AHn(l - (MD)) 2 ) > r(D) > -ln(l - (^(D)) 2 ) (10) 

which implies Equation (8). In order to exploit this relation for the separable case we first consider the following 
Lemma. 

Lemma 1. The mutual coherence of the Kronecker product of two matrices A and B with normalized columns 
is equal to the maximum of the individual mutual coherences, i.e. 

/i(B <g> A) = max{/x(A), //(B)}. (11) 



Proof. First, notice that since the columns of A and B all have unit norm, the diagonal entries of both A A 
and B T B are equal to one and that the mutual coherence fi(A) and ft(B) is given by largest off-diagonal absolute 
value of A T A and B T B, respectively. Analogously, /x(B ® A) is just the largest off-diagonal absolute value of 
the matrix (B A) T (B ® A) = (B T B) <g> (A T A). Due to the definition of the Kronecker product and the unit 
diagonal, each entry of B T B and A T A reappears in the off-diagonal entries of (B ® A) T (B ® A). This yields 
the two inequalities /x(B) < /x(B ® A) and //(A) < /x(B ® A), which can be combined to 

max{^(A), /i(B)} < /i(B <g> A). (12) 

On the other hand, each entry of (B T B) ® (A T A) is a product of entries of B T B and A T A. This explicitly 
means that we can write /x(B ® A) = ba with b and a being entries of B T B and A T A, respectively. Since we 
have < a, b < 1, this provides the two inequalities /x(B ® A) < b and /x(B ® A) < a, and hence 



/x(B ® A) < max{^(A), /i(B)}. 



(13) 



Combining (12) and (13) provides the desired result. 

Substituting /x(B <g> A) into Equation (8) and then applying Lemma 1 yields 



□ 



max{r(B), r(A)} > - ln(l - M (B ® A) 2 ) 

> max{^r(B), ^r(A)} ^ !) 

due to the monotone behavior of the logarithm. Therefore, if max{r(B), r(A)} is small, /i(B ® A) is bounded 
as well. Now, in order to keep the mutual coherence of B ® A moderate, we use the relation 

Ci(r(B) + r(A)) < max{r(B), r(A)} 

<C 2 (r(B) + r(A)), (15) 

for some positive constants Ci, C2 and minimize the sum r(B) + r(A) instead of max{r(B), r(A)} for compu- 
tational convenience. 

Finally, putting all the collected ingredients together, to learn a separable dictionary our goal is to minimize 

/: R axbx ™ x S(M) x S(w,b) -> R, 

(*> A > B ) ^ 2k E II AX,B T - S, HI + 

+ Acr(A) + kt(B). (16) 

Therein, A E R + weighs between the sparsity of X and how accurately AXjB T reproduces the training samples. 
Using this parameter, SeDiL can handle both perfect noise free training data as well as noisy training data. The 
second weighting factor n E K + controls the mutual coherence of the learned dictionary. 

3. Learning on Matrix Manifolds 

Knowing that the feasible set of solutions to Problem (16) is restricted to a smooth manifold allows us to 
apply methods from the field of geometric optimization to learn the dictionary. To provide the necessary notation, 
we shortly recall the required concepts of optimization on matrix manifolds. For an in-depth introduction on 
optimization on matrix manifolds, we refer the interested reader to [1]. 

Let M be a smooth Riemannian submanifold of some Euclidean space, and let / : M —> R be a differentiable 
cost function. We consider the problem of finding 

argmin/(y). (17) 

To every point y <E M one can assign a tangent space Ty M, which is a real vector space containing all possible 
directions that tangentially pass through y. An element S E Ty M is called a tangent vector at y. Each tangent 
space is associated with an inner product inherited from the surrounding Euclidean space which we denote by 
(•, •) and the corresponding norm by || • ||. The Riemannian gradient of / at 3^ is an element of the tangent space 
Ty M that points in the direction of steepest ascent of the cost function on the manifold. For the case where 
/ is globally defined on the entire surrounding Euclidean space, the Riemannian gradient G(y) is simply the 
orthogonal projection of the (standard) gradient V/(3^) onto the tangent space Ty M, which reads as 

G(y) = n TyM (vf(y)). (is) 

A geodesic is a smooth curve Tm(3 ; , S, t) emanating from y in the direction of H E Ty M, which locally describes 
the shortest path between two points on M. Intuitively, it can be interpreted as the generalization of a straight line 



to a manifold. The Riemannian exponential mapping, which maps a point from the tangent space to the manifold, 
is defined as 

expy : Ty M -+ M, 3 ^ T M (y, 3, 1). (19) 

The geometric optimization method proposed in this work is based on iterating the following line search scheme. 
Given the iterate y^\ a search direction %W E T-y(i) M, and the step size aW ^ Rat the i th iteration, the new 
iterate lying on M is found via 

=r M (y (0 ,W W ,aW) J (20) 

z.e. following the geodesic emanating from 3^ in the search direction for the length 

In the following, we concretize the above concepts for the situation at hand and present all ingredients that 
are necessary to implement the proposed geometric dictionary learning method. The given formulas regarding 
the geometry of S(n, d) are derived e.g. in [ ]. Here we are considering the product manifold M := R axbxm x 
S(h, a) x S(w, 6), which is a Riemannian submanifold of R axbxrn x R hxa x R wxb , and an element of M is denoted 
by y = (X, A, B). The tangent space at D E S(n, d) is given by 

T D S(n, d) = {S E R nxd \ ddiag(D T S) = 0}, (21) 

and the orthogonal projection of some matrix Q E R nxd onto the tangent space reads as 

n TDS(M )(Q) = Q - Dddiag(D T Q). (22) 

Due to the product structure of M, the tangent space of M at a point y E M is simply the product of all individual 
tangent spaces, i.e. Ty M := R axbxm x Ta S(/i, a) x Tb S(w, b). Consequently, in accordance with Equation 
(21) the orthogonal projection of some arbitrary point Q = (Qi, Q 2 , Q 3 ) e R axbxm x R hxa x R wxb onto the 
tangent space Ty M is 

n TyM (Q) = (Qi,n TAS(M) (Q 2 ),n rBSK6) (Q 3 )). (23) 

Each tangent space of M is endowed with the Riemannian metric inherited from the surrounding Euclidean space, 
which for two points K = (7£i, R 2 , R3) and V = (Pi, P 2 , P3) E Ty M is given by 

(K,V) := 

^tr ((R lj ) T Pi, J ) +tr(R^P 2 ) +tr(RjP 3 ). 1 } 

The final required ingredient is a way to compute geodesies. While in general there is no closed form solution 
to the problem of finding a certain geodesic, the case at hand allows for an efficient implementation. Let d E S n_1 
be a point on a sphere and h E TdS n_1 be a tangent vector at d, then the geodesic in the direction of h is a great 
circle 

fd, i/||h|| 2 = 

7(d,h,t) = ^ sin(t||h|| 2 ) „ • ( 25 ) 

^dcos(t||h|| 2 ) + h — ||h|| 2 ' otherwise. 

Using this, the geodesic through D E S(n, d) in the direction of H E To S(n, d) is simply the combination of the 
great circles emerging from each column of D in the direction of the corresponding column of H, i.e. 

r s(M) (D, H, t) = [ 7 (di, hi, t), . . . , 7 (d d , hd, t)\. (26) 

Now, let 7^ = (Hi, H 2 , H3) E Ty M be a given search direction. Due to the product structure of M a geodesic on 
M is given by 

r M (27) 

(Af + t^i, r S (^ a )(A, H 2 , t), r S (^ ;6 )(B, H 3 , t)). 

The shorthand notation := Q(y^) will be used throughout the rest of this paper to denote the Riemannian 
gradient at the i th iterate. 



4. Separable Dictionary Learning (SeDiL) 



To solve optimization problem (16), we employ a geometric conjugate gradient (CG) method, as it offers su- 
perlinear rate of convergence, while still being applicable to large scale optimization problems with acceptable 
computational complexity Therein, the initial search direction is equal to the negative Riemannian gradient, i.e. 

= —Q^\ In the subsequent iterations, H^ 1 ^ is a linear combination of the gradient and the previous 

search direction Since addition of vectors from different tangent spaces is not defined, we need to map %M 
from Ty(i) M to T-y^+i) M. This is done by the so-called parallel transport 7m (S, y^\vS l \ aW), which trans- 
ports a tangent vector S e Ty^ M along the geodesic Tm(3^\ % , to the tangent space T-y^+i) M. Similar 
to the way we derived a closed form solution for the geodesic, we consider the geometry of S(n, d) at first. The 
parallel transport of a tangent vector £ e TdS n_1 along the great circle 7(d, h, t) is 

r(£,d,h,t) = (28) 
| - 0(d||h|| 2 sin(f ||h|| 2 ) + h(l - cos(t||h|| 2 ))), 

Il n ll2 

and the parallel transport of 3 E 7b S(n, d) along the geodesic T S ( n (D, H, t) is given by 

7s(n,d)(S,D,H,t) = 
[r(€i,di,hi,t), . . . ,T(£ d ,d d ,h d ,t)]. 

Thus, a tangent vector S = (Si, S 2 , S3) E Ty M is transported in the direction ofHeTyM via 

7^(S,y,H,t) = 

11 „ (30) 

7s(/ l , a )( H 2, A, H 2 , t), 7s(^,6)( S 3, B, H 3 , £)). 

Now, using the shorthand notation 7; (m) := 7m(S, 3^), the new search direction is computed by 

n (i+i) = -gm) + p(i) T £+i). (31) 

We update /?W following the hybrid optimization scheme which is proposed in [ ] and has shown excellent per- 
formance in practice. The authors combine the Hestenes-Stiefel (HS) and Dai- Yuan (DY) update formulas, which 
are given by 

oil) _ (g(*+D,z(*+ 1 >) R {i) _ 

VHS — / T (*+l) 7 (i+i)\' ^ ~~ / T (<+1) 7 (i+i)\' V ; 

with Z( i+1 ) := - to create the hybrid update formula 

$] h = max{0, min{/3», /?«}}. (33) 

In order to find an appropriate step size aW, we propose a Riemannian adaption of the nonmonotone line 
search algorithm proposed in [ ]. Like other nonmonotone line search schemes it has the potential to improve 
the likelihood of finding a global minimum as well as to increase the convergence speed, cf. [6]. In contrast 
to the standard Armijo rule and standard nonmonotone schemes, which generally use the function value at the 
previous iterate or the maximum of the previous m iterates, this particular method utilizes a convex combination 
of all function values at previous iterations. The pseudo code for a version of this line search scheme that is 
adapted to our geometric optimization problem can be found in Algorithm 1. The line search is initialized with 
= f(yW) and = 1. Finally, our complete method of learning a dictionary with separable structure is 
summarized in Algorithm 2. 



Algorithm 1 Nonmonotone Line Search on M in the i th Iteration 

Input: 4 i} > 0, < ci < 1, < c 2 < 0.5, \i > 0, < rj^ < 1, QW, CW 
Set: t <- t$ 

while f(r M (y^\n^\t)) > C« + c 2 t(g( i U (<) > do 

t ^— ci* 
end while 
Set: Q( i+1 ) <- t/WqW + 1, 

c(*+i) <- (t/WqWcW + /(r M (y (i) ,H«,t)) 

Output: a»,g( i+1 ),C( i+1 ) 

Algorithm 2 Separable Dictionary Learning (SeDiL) 

Input: Initial dictionaries e S(h, a), e S(w, b), training data 5 e R hxwXrn , parameters p, A, thresh 

Set: i <r- 0, y(°) <- ({A(°)S,B(°) T )}- 15 A(°),B(°)),H(°) <- -^ (0) 
repeat 

qW, Q( l+1 \ C^ +1 ) according to Algorithm 1 in conjunction with Equation (16) 

y (i+1) <- r M (^U W ^ W ), cf. (27) 
0«+D <_ n Ty(i+1) M (V/(^ +1 ))), cf. (23) 

n (W) ^ _g(W) + p® b T££\ cf. (31), (33) 
i i + 1 

until \\G^\\ < thresh Vi = maximum # iterations 

Output: <- 



5. Experiments 

To show how dictionaries learned via SeDiL perform in real applications, we present the results achieved 
for denoising images corrupted by additive white Gaussian noise of different standard deviation a no i se as a case 
study. The images and the noise levels chosen here are an excerpt of those commonly used in the literature. 
The peak signal-to-noise ratio (PSNR) between the ground- truth image vec(S) E R N and the recovered image 
vec(S*) e R N computed by PSNR = 101og(255 2 iV/ J2iLi( s i ~ <) 2 ) is used t0 quantify the reconstruction 
quality. As an additional quality measure, we use the mean Structural SIMilarity Index (SSIM) computed with 
the same set of parameters as originally suggested in [ ]. SSIM ranges between zero and one, with one meaning 
perfect image reconstruction. Compared to PSNR, the SSIM better reflects the subjective visual impression of 
quality. 

Here, we present the denoising performance of both a universal unstructured dictionary, i.e. Di = 1 ® A, 
and a universal separable dictionary D2, both learned from the same training data using SeDiL. By universal, 
we mean that the dictionary is not specifically learned for a certain image class but universally applicable to 
any image content. Without loss of generality we choose square image patches with w = h = 8, which is in 
accordance to the patch-sizes mostly used in the literature. For the unstructured dictionary we set a = Awh, 
and for the separable one we choose a = b = 2w, i.e. A and B are of equal size and D2 = B ® A is of the 
same dimension as its unstructured counterpart. For the training phase, we extracted 40 000 image patches from 
four images at random positions and vectorize them. Of course, these images are not considered further within 
the performance evaluations. The training patches were normalized to have zero mean and unit £2 -norm. We 
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(a) Unstructured Dictionary (b) Separable Dictionary 

Figure 1. Learned atoms of (a) unstructured dictionary Di = 1 A and (b) separable dictionary D 2 = B (g) A for a patch 
size of 8 x 8. Each atom is shown as a 8 x 8 block where a black pixel corresponds to the smallest negative entry, gray is a 
zero entry, and white corresponds to the largest positive entry. 



initialized A and B with random matrices with normalized columns. Global convergence to a local minimum has 
always been observed, regardless of the initialization. The weighting parameters were empirically set to p = 100 
and A = k = The resulting atoms of the unstructured dictionary Di and the separable dictionary D2 = B ® A 
are shown in Figure 1(a) and 1(b), respectively. 

To denoise the images, we first find the sparse representation X* of each noisy patch over A, B by solving 

X* = arg min ||X< ||i + X d \\ AX,B T - S,|||. (34) 

employing the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [ ]. The regularization parameter A^ 
depends on the noise level and we set it to A^ = ^g^. After that, a clean image patch is computed from the sparse 
coefficients via S* = AX*B T . Last, as all overlapping image patches are taken into account, several solutions for 
the same pixel exist, and the final clean image is built by averaging all overlapping image patches. All achieved 
results are given in Table 1. 

To compare and rank the learned dictionaries among existing state-of-the-art techniques, we present the de- 
noising performance of a universal dictionary Dksvd learned using K-SVD from the same training set as used for 
SeDiL and of equal dimension as the unstructured dictionary Di. From Table 1, it can be seen that employing Di 
always yields slightly better denoising results compared to employing T>ksvd- Employing the separable dictionary 
D2 leads to results that are slightly worse compared to employing the unstructured counterpart. This is the tribute 
that has to be paid for its predefined structure. However, the separability allows a fast implementation just as the 
popular and also separable Overcomplete Discrete Cosine Transform (ODCT). Here, it can be observed that the 
separable dictionary D2 learned by SeDiL outperforms the ODCT for most images, while requiring exactly the 
same computational cost. 

The second advantage besides computational efficiency that comes along with the capability of learning a 
separable dictionary is that SeDiL allows to learn sparse representations for image patches whose size lets other 



Table 1. PSNR in dB and SSIM for denoising the five test images corrupted by five noise levels. Each cell presents the 
results for the respective image and noise level for five different methods: top left FISTA+K-SVD dictionary, top right 
FISTA+unstructured SeDiL, middle left FISTA+QDCT, middle right FISTA+separable SeDiL, bottom BM3D. 
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unstructured dictionary learning methods fail due to numerical reasons. In order to demonstrate the capability 
of SeDiL in this domain, a separable dictionary is learned from a training set consisting of 12 000 images of 
dimension (64 x 64) showing frontal face views of different persons. These training images were randomly 
extracted from the 13 228 faces of the "Cropped Labeled Faces in the Wild Database" 1 [12, 18]. The remaining 
1228 images were used for the following inpainting experiments. Note that the face positions in the pictures are 
arbitrary, see Figure 2 for five exemplary chosen training faces. The dimensions of the resulting matrices A, B 
were set to (64 x 128) and all other parameters required for the learning procedure were chosen as above. 

The ability of the separable dictionary to capture the global structure of the training samples is illustrated by 
an inpainting experiment for face images of size 64 x 64, where large regions are missing. These images have of 
course not been included in the training set. We assume that the image region that has to be filled up is given. The 
inpainting procedure is again conducted by applying FISTA on the inverse problem 

X* = arg min ||X||i + A d ||pr(AXB T ) - y\\l (35) 

xeR axb 

where the measurements y E R m are the available image data and pr(-) : M, wxh —> R m is a projection onto the 
corresponding region with available image data. 

An excerpt of the achieved results is given in Figure 3. We like to mention that this experiment should not 
be seen as a highly sophisticated face inpainting method, but rather should supply evidence that SeDiL is able to 
properly extract the global information of the underlying training set. 




Figure 2. Five exemplarily chosen training images. 

6. Conclusion 



We propose a new dictionary learning algorithms called SeDiL that is able to learn both unstructured dictionar- 
ies as well as dictionaries with a separable structure. Employing a separable structure on dictionaries reduces the 

*http : / /itee . uq . edu . au/ ~conrad/lf wcrop/ 




Figure 3. Five exemplary large scale inpainting results. The first row shows the original images from which large regions are 
removed in the second row. The last row shows the inpainting results achieved by SeDiL. 



computational complexity from 0(n) to 0(y/n) compared to employing unstructured dictionaries, with n being 
the considered signal dimension. Due to this, separable dictionaries can be learned using far larger signal dimen- 
sions as compared to those used for learning unstructured dictionaries, and they can be applied very efficiently 
in image reconstruction tasks. Another advantage of SeDiL is that it allows to control the mutual coherence of 
the resulting dictionary. Therefore, we introduce a new mutual coherence measure and put it in relation to the 
classical mutual coherence. The SeDiL algorithm we propose is a geometric conjugate gradient algorithm that 
exploits the underlying manifold structure. Numerical experiments for image denoising show the practicability of 
our approach, while the ability to learn sparse representations of large image-patches is demonstrated by a face 
inpainting experiment. 
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